The following report focuses on analyzing the World Development Indicators (WDI) data about the prosperity of many countries in the world with a goal to find some interesting correlations and observations. Additionally, an attempt was made to make a regressor that can predict the price of Bitcoin based on this data (as well as data about gold prices, S&P Composite values, and currency values).
The data has been gathered mainly by the World Bank and it contains more than 200 statistics that measure general market development among most countries in the world.
First, after reading and cleaning the data, each dataset was thoroughly analyzed and looked at. Missing values were filled and additional filtering was performed to reduce the size of datasets if necessary.
Then, there was a section with a primary focus to find interesting correlations in the WDI dataset. Correlations were calculated separately for each region and the entire dataset as well. Similar patterns were found in each of them. Most strong correlations that were observed were trivial (e.g. population in total growing with the population in urban or rural areas).
Some were more interesting - like a moderate correlation between the number of under-five deaths (number of children dying before reaching age five) and the total population. It is suggesting that medicine is getting more and more advanced as years go by. The global under-five mortality rate declined by 59 percent since 1990.
Another finding that was quite alarming is that there is a very strong correlation between a total population and the amount of different gases that countries produce that influence climate change and global warming.
There was also a brief overview of the effect that the COVID-19 pandemic had on the global economy. It was observed that there was around a 5% drop in the GDP on average in 2020.
The last section focused on finding a model that could predict the price of Bitcoin. Two algorithms were tried - Stochastic Gradient Boosting and Quantile Random Forest. Both of them performed similarly, but the first one was chosen at the end (based on the R squared, RMSE, and MAE metrics).
After analyzing the variable importance it was found that the most important attribute in predicting the price was the difficulty of finding a new block in the blockchain network.
The final results were not entirely satisfying (RMSE equals almost 2000 and the R squared value around 0.36) and the conclusion was that the dataset does not capture all factors that affect the market price of Bitcoin.
Besides, time series modeling with ARIMA was tried and compared with the regressor. It had an RMSE value around 1300, but it was used to predict only the next 20 values (versus the 813 in the test set for the regressor) and so the comparison would not be very accurate.
set.seed(42)
First, all available datasets are read into memory. To make the analysis simpler, some columns are renamed and their type changed. Additionally, blockchain data was scattered among many files. Therefore, they are joined, so that they fit a single data frame.
ex_rates <- read.csv("data/CurrencyExchangeRates.csv", colClasses = c(Date = "POSIXct"))
gold_prices <- read.csv("data/Gold prices.csv", colClasses = c(Date = "POSIXct"))
sp_composite <- read.csv("data/S&P Composite.csv", colClasses = c(Year = "POSIXct")) %>% rename(Date = Year)
bchain_difficulty <- read.csv("data/Bitcoin/BCHAIN-DIFF.csv", colClasses = c(Date = "POSIXct"))
bchain_hash_rate <- read.csv("data/Bitcoin/BCHAIN-HRATE.csv", colClasses = c(Date = "POSIXct"))
bchain_market_price <- read.csv("data/Bitcoin/BCHAIN-MKPRU.csv", colClasses = c(Date = "POSIXct"))
bchain_trade_volume <- read.csv("data/Bitcoin/BCHAIN-TRVOU.csv", colClasses = c(Date = "POSIXct"))
bchain <- bchain_difficulty %>%
inner_join(bchain_hash_rate, by = "Date") %>%
inner_join(bchain_market_price, by = "Date") %>%
inner_join(bchain_trade_volume, by = "Date") %>%
rename(Difficulty = 2, Hash_rate = 3, Market_price = 4, Trade_volume = 5)
wdi <- read_excel(
path = "data/World_Development_Indicators.xlsx",
sheet = 1,
na = "..",
n_max = 44304
) %>%
rename_with(~ substr(.x, 1, 4), contains("YR"))
The original data about World Development Indicators does not include any information about the region that a country belongs to. Therefore, additional dataset was obtained from The World Bank website in order to make the analysis simpler.
class <- read_excel("data/CLASS.xlsx")
wdi <- wdi %>%
left_join(class %>% select(Economy, Region), by = c("Country Name" = "Economy"))
This dataset is still “dirty”. Each year is a separate column, so those columns were “gathered” into rows (so that there is a single “Year” column now). Then each indicator value is “spread”, so that there is a separate column representing each indicator, decreasing the overall row count.
wdi <- wdi %>%
gather(key="Year", value="Value", ("1970":"2020")) %>%
mutate(Year = as.numeric(Year)) %>%
select(-`Series Name`) %>%
spread(key=`Series Code`, value = Value)
This dataset contains information about exchange rates. Prices are presented relative to the USD currency. It has 52 attributes and 5978 observations. Only some currencies are shown for brevity.
| Date | Polish.Zloty | Euro | U.S..Dollar | |
|---|---|---|---|---|
| Min. :1995-01-02 00:00:00 | Min. :2.022 | Min. :0.8252 | Min. :1 | |
| 1st Qu.:2000-10-05 06:00:00 | 1st Qu.:3.033 | 1st Qu.:1.0889 | 1st Qu.:1 | |
| Median :2006-07-06 12:00:00 | Median :3.290 | Median :1.2295 | Median :1 | |
| Mean :2006-07-27 21:33:31 | Mean :3.365 | Mean :1.2076 | Mean :1 | |
| 3rd Qu.:2012-05-07 18:00:00 | 3rd Qu.:3.822 | 3rd Qu.:1.3338 | 3rd Qu.:1 | |
| Max. :2018-05-02 00:00:00 | Max. :4.500 | Max. :1.5990 | Max. :1 | |
| NA | NA’s :1765 | NA’s :1070 | NA |
For further analysis, it is considered satisfactory to only focus on one currency. Euro is chosen because of its popularity and the fact that it is mostly free from missing values. Most of the missing values are in years previous than 2000 and the analysis will probably not look that far. Missing values are replaced using the previous entry.
ex_rates <- ex_rates %>%
select(Date, Euro) %>%
fill(Euro, .direction = "updown")
This dataset contains information about monthly S&P Composite values starting from 1871. There are 1810 records in total.
| Date | S.P.Composite | Dividend | Earnings | CPI | Long.Interest.Rate | Real.Price | Real.Dividend | Real.Earnings | Cyclically.Adjusted.PE.Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :1871-01-31 00:00:00 | Min. : 2.730 | Min. : 0.1800 | Min. : 0.1600 | Min. : 6.28 | Min. : 0.620 | Min. : 73.9 | Min. : 5.445 | Min. : 4.576 | Min. : 4.784 | |
| 1st Qu.:1908-10-07 18:00:00 | 1st Qu.: 7.902 | 1st Qu.: 0.4202 | 1st Qu.: 0.5608 | 1st Qu.: 10.20 | 1st Qu.: 3.171 | 1st Qu.: 186.6 | 1st Qu.: 9.417 | 1st Qu.: 14.063 | 1st Qu.:11.898 | |
| Median :1946-06-15 00:00:00 | Median : 17.370 | Median : 0.8717 | Median : 1.4625 | Median : 20.35 | Median : 3.815 | Median : 283.3 | Median :14.411 | Median : 23.524 | Median :16.381 | |
| Mean :1946-06-15 19:15:13 | Mean : 327.968 | Mean : 6.7321 | Mean : 15.3714 | Mean : 62.39 | Mean : 4.504 | Mean : 622.0 | Mean :17.498 | Mean : 34.907 | Mean :17.215 | |
| 3rd Qu.:1984-02-21 18:00:00 | 3rd Qu.: 164.400 | 3rd Qu.: 7.0525 | 3rd Qu.: 14.7258 | 3rd Qu.:102.28 | 3rd Qu.: 5.139 | 3rd Qu.: 707.0 | 3rd Qu.:22.301 | 3rd Qu.: 43.768 | 3rd Qu.:20.913 | |
| Max. :2021-10-31 00:00:00 | Max. :4493.280 | Max. :59.6800 | Max. :158.7400 | Max. :273.98 | Max. :15.320 | Max. :4477.2 | Max. :63.511 | Max. :159.504 | Max. :44.198 | |
| NA | NA | NA’s :4 | NA’s :4 | NA | NA | NA | NA’s :4 | NA’s :4 | NA’s :120 |
Most columns do not have missing values. Four columns have just 4 missing values. Only one column has 120 missing values. They are replaced using the previous entry.
sp_composite <- sp_composite %>%
fill(Dividend, Earnings, Real.Dividend, Real.Earnings, Cyclically.Adjusted.PE.Ratio, .direction = "updown")
This dataset contains information about Bitcoin’s price changing in time as well as some blockchain technology attributes:
It has 4659 rows in total.
| Date | Difficulty | Hash_rate | Market_price | Trade_volume | |
|---|---|---|---|---|---|
| Min. :2009-01-03 00:00:00 | Min. :0.000e+00 | Min. : 0 | Min. : 0.00 | Min. :0.000e+00 | |
| 1st Qu.:2012-03-12 12:00:00 | 1st Qu.:1.689e+06 | 1st Qu.: 12 | 1st Qu.: 7.21 | 1st Qu.:1.948e+05 | |
| Median :2015-05-21 00:00:00 | Median :4.881e+10 | Median : 356089 | Median : 431.89 | Median :6.824e+06 | |
| Mean :2015-05-21 00:24:32 | Mean :3.665e+12 | Mean : 26458258 | Mean : 5132.38 | Mean :1.467e+08 | |
| 3rd Qu.:2018-07-28 12:00:00 | 3rd Qu.:5.364e+12 | 3rd Qu.: 38265984 | 3rd Qu.: 6496.35 | 3rd Qu.:1.484e+08 | |
| Max. :2021-10-05 00:00:00 | Max. :2.505e+13 | Max. :198514006 | Max. :63554.44 | Max. :5.352e+09 |
Since inner join operations were performed shortly after reading the data, there are no missing values here.
This dataset contains information about the price of gold across multiple years, starting from 1968. There are 13585 rows.
| Date | USD..AM. | USD..PM. | GBP..AM. | GBP..PM. | EURO..AM. | EURO..PM. | |
|---|---|---|---|---|---|---|---|
| Min. :1968-01-02 00:00:00 | Min. : 34.77 | Min. : 34.75 | Min. : 14.48 | Min. : 14.48 | Min. : 237.3 | Min. : 236.7 | |
| 1st Qu.:1981-06-10 00:00:00 | 1st Qu.: 280.50 | 1st Qu.: 281.50 | 1st Qu.: 177.71 | 1st Qu.: 178.23 | 1st Qu.: 335.3 | 1st Qu.: 335.2 | |
| Median :1994-11-14 00:00:00 | Median : 383.32 | Median : 383.50 | Median : 234.51 | Median : 234.96 | Median : 892.6 | Median : 896.1 | |
| Mean :1994-11-16 03:40:00 | Mean : 575.20 | Mean : 576.62 | Mean : 370.84 | Mean : 371.81 | Mean : 797.3 | Mean : 797.2 | |
| 3rd Qu.:2008-04-23 00:00:00 | 3rd Qu.: 841.94 | 3rd Qu.: 851.50 | 3rd Qu.: 454.32 | 3rd Qu.: 456.43 | 3rd Qu.:1114.1 | 3rd Qu.:1114.9 | |
| Max. :2021-09-29 00:00:00 | Max. :2061.50 | Max. :2067.15 | Max. :1574.37 | Max. :1569.59 | Max. :1743.8 | Max. :1743.4 | |
| NA | NA’s :1 | NA’s :143 | NA’s :11 | NA’s :154 | NA’s :7837 | NA’s :7880 |
For similar reasons as with the exchange rates dataset, only one currency will be chosen. This time, it will be USD.AM column (price taken in the morning), because it has only one missing value. It will be replaced with the previous entry.
gold_prices <- gold_prices %>%
select(Date, USD..AM.) %>%
fill(USD..AM.)
World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available and includes national, regional, and global estimates.
WDI contains a large host of socio-economic indicators, from the widely used such as population and GDP to the more esoteric such as the percent of households that consume iodized salt.
This dataset contains information about indicators between the years 1970 and 2020 (10608 rows).
In the table summarizing the data below, only three indicator columns are shown for brevity.
| Country Name | Country Code | Region | Year | AG.LND.PRCP.MM | AG.LND.TOTL.K2 | AG.LND.TOTL.UR.K2 | |
|---|---|---|---|---|---|---|---|
| Length:10608 | Length:10608 | Length:10608 | Min. :1970 | Min. : 18.1 | Min. : 2 | Min. : 0 | |
| Class :character | Class :character | Class :character | 1st Qu.:1982 | 1st Qu.: 591.0 | 1st Qu.: 17200 | 1st Qu.: 566 | |
| Mode :character | Mode :character | Mode :character | Median :1995 | Median :1071.0 | Median : 107160 | Median : 3395 | |
| NA | NA | NA | Mean :1995 | Mean :1195.8 | Mean : 2719581 | Mean : 82862 | |
| NA | NA | NA | 3rd Qu.:2008 | 3rd Qu.:1761.0 | 3rd Qu.: 547566 | 3rd Qu.: 15565 | |
| NA | NA | NA | Max. :2020 | Max. :3240.0 | Max. :129956634 | Max. :3629312 | |
| NA | NA | NA | NA | NA’s :9016 | NA’s :116 | NA’s :10086 |
Since there are many indicators (213 in total) and it would be quite tedious to focus on all of them, some efforts are made to filter them. Indicators with 20% or more of values that are missing are removed entirely.
wdi <- wdi %>%
select(where(~mean(is.na(.)) < 0.2))
With this approach, exactly forty indicators are left. That is still a lot, but correlations between them are going to be analyzed to choose which ones will be interesting to examine further.
Missing values for the remaining indicators are going to be filled with the same strategy as for previous datasets (with the previous entry).
wdi <- wdi %>%
fill(everything(), .direction = "downup")
We can observe that all graphs share the same, similar triangle for positive correlation and a similar area for negative correlation (although those are more vivid in North America - most likely thanks to the uniqueness of the USA). Because of that, we can conclude that the correlation is analogous for each region. Therefore, the following analysis will focus on the correlation matrix for the entire dataset.
Most correlations that we can notice there are trivial, eg.:
What can be quite interesting at a first glance is that there is a strong negative correlation (close to -1.0) between males (% of the total population) and females (% of the total population). But after a closer inspection and a bit of thinking it is quite obvious. When the % of females is increasing, the % of males has to decrease at the same time and vice versa. Those two values cannot grow together.
On average, there used to be slightly more women than men between the years 1970 and 2009. After that year, it changed. 2009 was also a year when Bitcoin was first introduced. Coincidence?
Another curious finding might be that there is a positive correlation (but not that strong - around 0.75) between the number of under-five deaths (number of children dying before reaching age five) and the total population. It would be expected that those indicators grow together similarly (correlation closer to 1.0), but they are not. It might mean that as years go by, medicine is getting more and more advanced. Thanks to that, the number of under-five deaths does not grow as rapidly as the total population.
This source claims that the global under-five mortality rate declined by 59 per cent, from 93 deaths per 1,000 live births in 1990 to 38 in 2019.
Another discovery that is not very surprising (but quite alarming) is that there is a strong correlation (close to 1.0) between a total population and the amount of different gases that countries produce that influence climate change and global warming. This is to be expected - the more people there are, the more houses, factories, machines, etc. we produce.
Given the fact that the world’s population is constantly growing at a very rapid rate, greenhouse gases production is also going to increase as well.
We can notice on the charts above quite a sharp decline in all gases emission after the year 1990. This article on Nature claims that it can be contributed to the Soviet Union’s collapse because it meant many people stopped eating meat. An estimated one-third of late-Soviet cropland has been abandoned after the post-communist economic crisis.
This section presents a quick overview of the effect that the COVID-19 pandemic had on the world’s economy. There is widespread agreement among economists that it will have severe negative impacts on the global economy.
The world, on average, lost around 5 percent of the gross domestic product in 2020 as we can see above.
To put this number in perspective, global GDP was estimated at around 84.54 trillion U.S. dollars in 2020 – meaning that a 4.5 percent drop in economic growth results in almost 2.96 trillion U.S. dollars of lost economic output.
This section focuses on creating a model that can predict the price of a Bitcoin.
The dataset used for training the model will consist of the following attributes:
These last four attributes were chosen because they were described in more detail in the previous section about the correlation. The USA was selected in particular because it is the main producer of those greenhouse gases.
usa_indicators <- wdi %>%
filter(`Country Name` == 'United States') %>%
select(Year, SP.POP.TOTL, EN.ATM.CO2E.KT, EN.ATM.METH.KT.CE, EN.ATM.NOXE.KT.CE) %>%
rename(US_Population = 2, US_CO2_emission = 3, US_Methane_emission = 4, US_Nitrous_oxide_emission = 5)
data <- bchain %>%
mutate(Year = as.numeric(format(Date, "%Y"))) %>%
left_join(ex_rates, by="Date") %>%
left_join(gold_prices, by="Date") %>%
left_join(sp_composite %>% select(Date, S.P.Composite), by="Date") %>%
left_join(usa_indicators, by="Year") %>%
select(-Year) %>%
arrange(Date) %>%
rename(Gold = USD..AM.)
First, years 2009 and half of 2010 are going to be discarded, because Bitcoin’s market price for those years is equal to 0. The first day when Bitcoin price rise above zero is 2010-08-18 and therefore this is going to be the starting point in the dataset.
After that, there are still some missing values. For example, S.P Composite is collected only once per month. Those values are going to be filled using the nearby entry strategy.
Moreover, for some reason, trade volume has zero values randomly placed in the dataset. For example, for 2012-11-27 the trade volume is 389726.0, for 2012-11-28 it is 0, then on 2012-11-29, it is back at 331048.97. Hence, those values too are going to be replaced with values from nearby entries.
data <- data %>%
filter(Date >= "2010-08-18") %>%
fill(everything(), .direction = "up") %>%
fill(everything(), .direction = "down") %>%
select(-Date)
data[data$Trade_volume == 0, "Trade_volume"] = NA
data <- data %>%
fill(Trade_volume, .direction = "up") %>%
fill(Trade_volume, .direction = "down")
The summary of the final dataset is presented in the table below:
| Difficulty | Hash_rate | Market_price | Trade_volume | Euro | Gold | S.P.Composite | US_Population | US_CO2_emission | US_Methane_emission | US_Nitrous_oxide_emission | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :5.120e+02 | Min. : 0 | Min. : 0.06 | Min. :5.100e+01 | Min. :1.036 | Min. :1051 | Min. :1087 | Min. :309327143 | Min. :4813720 | Min. :609200 | Min. :242640 | |
| 1st Qu.:1.215e+07 | 1st Qu.: 112 | 1st Qu.: 104.04 | 1st Qu.:1.313e+06 | 1st Qu.:1.181 | 1st Qu.:1250 | 1st Qu.:1629 | 1st Qu.:316059947 | 1st Qu.:4950210 | 1st Qu.:617170 | 1st Qu.:246480 | |
| Median :1.635e+11 | Median : 1197070 | Median : 610.38 | Median :1.539e+07 | Median :1.201 | Median :1330 | Median :2094 | Median :323071755 | Median :4981300 | Median :620810 | Median :249290 | |
| Mean :4.198e+12 | Mean : 30309571 | Mean : 5879.46 | Mean :1.681e+08 | Mean :1.231 | Mean :1427 | Mean :2273 | Mean :321503885 | Mean :5006739 | Mean :620646 | Mean :248500 | |
| 3rd Qu.:6.379e+12 | 3rd Qu.: 45759634 | 3rd Qu.: 7323.06 | 3rd Qu.:1.856e+08 | 3rd Qu.:1.312 | 3rd Qu.:1621 | 3rd Qu.:2785 | 3rd Qu.:326838199 | 3rd Qu.:5089500 | 3rd Qu.:622590 | 3rd Qu.:250060 | |
| Max. :2.505e+13 | Max. :198514006 | Max. :63554.44 | Max. :5.352e+09 | Max. :1.488 | Max. :2062 | Max. :4493 | Max. :329484123 | Max. :5392870 | Max. :649940 | Max. :254860 |
To begin with, train and test sets have to be created. Because we deal with time series, regular stratified partitioning with the createDataPartition function would not be quite appropriate. Therefore, manual splitting is performed (on a dataset that is already sorted by date ascending). An 80/20% split is created. The train set will be further divided into train and validation sets.
breaking_point_index <- nrow(data) * 0.8
train <- data[1:breaking_point_index,]
test <- data[(breaking_point_index + 1):nrow(data),]
There are 3253 rows in the training set and 813 in the test set.
As mentioned earlier, we deal with time series, so any kind of cross-validation or similar methods would not be correct. We could have mixed values from different dates in training and validation sets. Luckily, the trainControl method allows us to choose a “timeslice” method that uses rolling forecasting origin techniques, that move the training and test sets in time.
Parameters passed to that method are specified so that there will be 8 windows created in total, where each training set has 800 rows and validation set 200 rows.
fitControl <- trainControl(method = "timeslice", fixedWindow = T, initialWindow = 800, horizon = 200, skip = 300)
First, let’s try training using Stochastic Gradient Boosting.
gbmFit <- train(Market_price ~ ., data = train, method = "gbm", trControl = fitControl, verbose = F)
gbmFit
## Stochastic Gradient Boosting
##
## 3253 samples
## 10 predictor
##
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (200 held-out with a fixed window)
## Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 1391.364 0.3501754 1184.469
## 1 100 1294.541 0.2695565 1085.826
## 1 150 1274.989 0.2955726 1066.256
## 2 50 1357.842 0.3599908 1135.415
## 2 100 1344.389 0.3819946 1120.356
## 2 150 1332.959 0.3838608 1108.177
## 3 50 1352.317 0.3984152 1123.858
## 3 100 1346.252 0.4420979 1117.584
## 3 150 1357.742 0.4455119 1134.790
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 1, shrinkage = 0.1 and n.minobsinnode = 10.
Another algorithm that will be tested is Quantile Random Forest.
qrfFit <- train(Market_price ~ ., data = train, method = "qrf", trControl = fitControl)
qrfFit
## Quantile Random Forest
##
## 3253 samples
## 10 predictor
##
## No pre-processing
## Resampling: Rolling Forecasting Origin Resampling (200 held-out with a fixed window)
## Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1419.542 0.2962583 1192.269
## 6 1452.531 0.3321094 1235.957
## 10 1501.793 0.3169855 1268.325
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
R squared is not visible because of the scale of the axis, so let’s plot it separately.
As we can see, Stochastic Gradient Boosting performs slightly better in general, and it is chosen as the final model.
Not surprisingly, the most important attribute is one that is related directly to blockchain technology - difficulty. It is something that our intuition would say because the bigger the Bitcoin price, the more people are interested in it, there are more miners and the difficulty of finding a new block grows as well.
What can be surprising is that the S.P Composite attribute is more important than the two remaining blockchain attributes (trade volume and hash rate).
What is quite remarkable here is that the US emission of CO2 is a better predictor of Bitcoin’s price than the Euro or gold price. According to the plot, they play no role in predicting the price (the same goes for the total US population, US methane, and nitrous oxide emissions).
For measuring the performance of the best model selected on the test set, some standard metrics for regression are used:
## RMSE Rsquared MAE
## 1.975857e+04 3.584703e-01 1.255453e+04
As one can see in the table above, the results are not extraordinary. Even though the R squared value on the test set is similar to the values observed on the training set, the RMSE is a bit higher. The reason for this is most likely the fact that there was a huge increase in Bitcoin’s price in 2021 and this data was not available for the training set. A potential solution would be to include more data so that the model could “see” that rapid growth.
Predicting the price of Bitcoin is hard because there are much more factors affecting it than were captured here in this report. For example, even trivial things like a positive tweet from Elon Musk with regards to cryptocurrency can have a tremendous effect on the market price.
We can also try some time series modeling with ARIMA.
First, a dataset is going to be created and split into train and test sets. It is going to be devised using the same rules as in the previous section. We don’t need additional features that were used when creating a regressor. Therefore, no additional tables are going to be joined. Only date and market price columns are left. We also start on the date 2010-08-18 because that is the first day when the Bitcoin price rises above 0.
data <- bchain %>%
select(Date, Market_price) %>%
filter(Date >= "2010-08-18") %>%
arrange(Date)
arima_train <- data[1:breaking_point_index, ]
arima_test <- data[(breaking_point_index + 1):nrow(data), ]
Now, since the Arima function only accepts univariate time series, we have to perform some transformations. The Zoo library has a built-in function for converting to a time series vector. We create a time series that changes daily in the specified date range.
arima_train_ts <-
zoo(arima_train$Market_price,
seq(
from = as.Date("2010-08-18"),
to = as.Date("2019-07-14"),
by = 1
))
arima_test_ts <-
zoo(arima_test$Market_price,
seq(
from = as.Date("2019-07-15"),
to = as.Date("2021-10-04"),
by = 1
))
Additionally, as we saw in the Bitcoin section when plotting the Bitcoin price, the data is not stationary and this is a requirement for the ARIMA model. To deal with this problem, we can compute differences of prices to stationarize the time series (by specifying the d parameter in the Arima() function - the difference order).
Now, we can fit the ARIMA model by specifying the three parameters (p, q, and d). Luckily, we can test many combinations of those parameters using the auto.arima function.
auto.arima(arima_train_ts, trace = T, d = 1)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,1,2) with drift : 43448.74
## ARIMA(0,1,0) with drift : 43497.45
## ARIMA(1,1,0) with drift : 43492.02
## ARIMA(0,1,1) with drift : 43491.35
## ARIMA(0,1,0) : 43496.5
## ARIMA(1,1,2) with drift : 43481.31
## ARIMA(2,1,1) with drift : 43487.25
## ARIMA(3,1,2) with drift : 43459.53
## ARIMA(2,1,3) with drift : 43431.39
## ARIMA(1,1,3) with drift : 43462.41
## ARIMA(3,1,3) with drift : Inf
## ARIMA(2,1,4) with drift : 43411.11
## ARIMA(1,1,4) with drift : 43452.07
## ARIMA(3,1,4) with drift : 43407.52
## ARIMA(4,1,4) with drift : Inf
## ARIMA(3,1,5) with drift : 43408.85
## ARIMA(2,1,5) with drift : 43407.98
## ARIMA(4,1,3) with drift : 43411.28
## ARIMA(4,1,5) with drift : Inf
## ARIMA(3,1,4) : 43406.41
## ARIMA(2,1,4) : 43409.93
## ARIMA(3,1,3) : Inf
## ARIMA(4,1,4) : Inf
## ARIMA(3,1,5) : 43407.77
## ARIMA(2,1,3) : 43430.26
## ARIMA(2,1,5) : 43406.9
## ARIMA(4,1,3) : 43410.15
## ARIMA(4,1,5) : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(3,1,4) : 43414.21
##
## Best model: ARIMA(3,1,4)
## Series: arima_train_ts
## ARIMA(3,1,4)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4
## 0.1212 -0.7051 -0.3882 -0.0694 0.7423 0.3496 0.1212
## s.e. 0.1099 0.0564 0.1090 0.1098 0.0565 0.1068 0.0192
##
## sigma^2 estimated as 36654: log likelihood=-21699.08
## AIC=43414.17 AICc=43414.21 BIC=43462.86
Finally, we can fit the model using the obtained parameters and evaluate the performance by predicting the next 20 values.
fitARIMA <- Arima(arima_train_ts, order = c(3, 1, 4))
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 3.184154 | 191.2157 | 57.46084 | 0.086365 | 3.716523 | 1.017451 | -0.0013748 |
| Test set | -1188.537915 | 1281.3689 | 1188.53791 | -12.043664 | 12.043664 | 21.045272 | NA |
Although the RMSE error is smaller than the one observed with the regressor created in the previous section, it is not accurate to compare that. The regressor was predicting values from the entire test set (that had 813 values) and the ARIA model predicted only the next 20 days. That also means it did not see that rapid growth of the price in 2021.
This quick experiment showed that sometimes it might make sense to consider using a standard time series modeling algorithm that is faster to implement than a typical model for regression or at least use it as a baseline to compare it with other models.